perm filename 162B01.TEX[1,RWF] blob sn#750182 filedate 1984-04-11 generic text, type C, neo UTF8
COMMENT āŠ—   VALID 00002 PAGES
C REC  PAGE   DESCRIPTION
C00001 00001
C00002 00002	\rm
C00005 ENDMK
CāŠ—;
\rm
\magnification=1200

\def\today{\ifcase\month\or
  January\or February\or March\or April\or May\or June\or
  July\or August\or September\or October\or November\or December\fi
  \space\number\day, \number\year}

\line{\sevenrm 162B01.tex[v2,rwf] \today\hfill}

\vskip .225in

\noindent
CS162

\vskip .225in 
\noindent
An entropy problem

We have a large file on external memory to sort.  It contains $n$ records,
and has  original  entropy  $\lg(n!)$,  since  there  are  $n!$  different
permutations which might  be required.   External memory  is organized  in
{\sl pages} of size $p$  records, which can be brought  in or out only  as
units; fast memory, where all  comparisons and computations are done,  has
room for only  a few pages.   We therefore  decide on this  outline of  an
algorithm.  We maintain two page-sized output buffers and one input buffer
in fast memory.  We read in data,  then, by whatever means, put each  word
into one of the output buffers.  When an output buffer is full, we  return
its page to external memory.   The last time we refer  to a page, we  sort
it.

Each decision of which buffer to put a record in gains at most one bit  of
information.   The  final  within-page  sorts  gain  information  at  most
$(n/p)\lg(p!)$, or about $n\lg p$.   The total information gained must  be
at least equal, on the average, to the entropy $\lg (n!)$, or about  $n\lg
n$, of the problem, so the number of times a record is placed in a  buffer
must be at least $n\lg n-n\lg p=n\lg (n/p)$; if we organize the  algorithm
in passes, each of which  sees each record once,  the number of passes  is
$n\lg(n/p)/n=\lg(n/p)$.  Thus  any algorithm  we can  program to  work  in
$\lg(n/p)$ passes is  close to  the maximum  achievable efficiency.   

\vfill \end